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In this study, we examined yeast proteins by two-dimensional (2D) gel electrophoresis and gathered quan- 
titative information from about 1,400 spots. We found that there is an enormous range of protein abundance 
and, for identified spots, a good correlation between protein abundance, mRNA abundance, and codon bias. 
For each molecule of well-translated mRNA, there were about 4,000 molecules of protein. The relative 
abundance of proteins was measured in glucose and ethanol media. Protein turnover was examined and found 
to be insignificant for abundant proteins. Some phosphoproteins were identified. The behavior of proteins in 
differential centrifugal ion experiments was examined. Such experiments with 2D gels can give a global view of 
the yeast proteome. 



The sequence of the yeast genome has been determined (9). 
More recently, the number of mRNA molecules for each ex- 
pressed gene has been measured (27, 30). The next logical level 
of analysis is that of the expressed set of proteins. We have 
begun to analyze the yeast proteome by using two-dimensional 
(2D) gels. 

2D gel electrophoresis separates proteins according to iso- 
electric point in one dimension and molecular weight in the 
other dimension (21), allowing resolution of thousands of pro- 
teins on a single gel. Although modern imaging and computing 
techniques can extract quantitative data for each of the spots in 
a 2D gel, there are only a few cases in which quantitative data 
have been gathered from 2D gels. 2D gel electrophoresis is 
almost unique in its ability to examine biological responses 
over thousands of proteins simultaneously and should there- 
fore allow us a relatively comprehensive view of cellular me- 
tabolism. 

We and others have worked toward assembling a yeast pro- 
tein database consisting of a collection of identified spots in 2D 
gels and of data on each of these spots under various condi- 
tions (2, 7, 8, 10, 23, 25). These data could then be used in 
analyzing a protein or a metabolic process. Saccharomyces 
cerevisiae is a good organism for this approach since it has a 
well-understood physiology as well as a large number of mu- 
tants, and its genome has been sequenced. Given the sequence 
and the relative lack of introns in S. cerevisiae, it is easy to 
predict the sequence of the primary protein product of most 
genes. This aids tremendously in identifying these proteins on 
2D gels. 

There are three pillars on which such a database rests: (i) 
visualization of many protein spots simultaneously, (ii) quan- 
tification of the protein in each spot, and (iii) identification of 
the gene product for each spot. Our first efforts at visualization 
and identification for S. cerevisiae have been described else- 
where (7, 8). Here we describe quantitative data for these 
proteins under a variety of experimental conditions. 

MATERIALS AND METHODS 
Strains and media. S. cerevisiae W303 (MATa ade2-l his3-11,15 leu2-3, J 12 
trpl-i uru3-l canhlOO) was used (26). -Met YNB (yeast nitrogen base) medium 
was 1.7 g of YNB (Difco) per liter, 5 g of ammonium sulfate per liter, and 
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adenine, uracil, and all amino acids except methionine; -Met -Cys YNB me- 
dium was the same but without methionine or cysteine. Medium was supple- 
mented with 2% glucose (for most experiments) or with 2% ethanol (for ethanol 
experiments). Low-phosphate YEPD was described by Warner (28). 

Isotopic labeling of yeast and preparation of cell extracts. Yeast strains were 
labeled and proteins were extracted as described by Garrets et al. (7, 8). Briefly, 
cells were grown to 5 X 10 6 cells per ml. at 3Q°C; 1 ml of culture was transferred 
to a fresh tube, and 0.3 mCi of ( 35 S]methionine (e.g., Express protein labeling 
mix; New England Nuclear) was added to this 1-ml culture. The cells were 
incubated for a further 10 to 15 min and then transferred to a 1.5 -ml microcen- 
trifuge tube, chilled on ice, and harvested by centrifugation. The supernatant was 
removed, and the cell pellet was resuspended in 100 u.1 of lysis buffer (20 mM 
Tris-HCI (pH 7.6], 10 mM NaF, 10 mM sodium pyrophosphate, 0.5 mM EDTA, 
0.1% deoxychojate; just before use, phenylmethylsulfonyl fluoride was added to 
1 mM, leupeptin was added to 1 M>g/ml, pepstatin was added to 1 ng/ml, tosyl- 
sulfonyl phenylalanyl chloromethyl ketone was added to 10 u-g/ml, and soybean 
trypsin inhibitor was added to 10 ng/ml). 

The resuspended cells were transferred to a screw-cap 1.5-ml polypropylene 
tube containing 0.28 g of glass beads (0.5-mm diameter, Biospec Products) or 
0.40 g of zirconia beads (0.5-mm diameter, Biospec Products). After the cap was 
secured, the tube was inserted into a MiniBeadbeater 8 (Biospec Products) and 
shaken at medium high speed at 4"C for 1 min. Breakage was typically 75%. 
Tubes were then spun in a microcentrifuge for 10 s at 5,000 X g at 4°C 

With a very fine pipette tip, liquid was withdrawn from the beads and trans- 
ferred to a prechilled 1.5-ml tube containing 7 uJ of DNase I (0.5 mg/ml; Cooper 
product no. 6330)-RNase A (0.25 mg/ml; Cooper product no. 5679)-Mg (50 mM 
MgCl 2 ) mix. Typically 70 u,l of liquid was recovered. The mixture was incubated 
on ice for 10 min to allow the RNase and DNase to work. 

Next, 75 mJ of 2x dSDS (2x dSDS is 0.6% sodium dodecyl sulfate [SDSJ, 2% 
mercaptoethanol, and 0.1 M Tris-HCI (pH 8J) was added. The tube was plunged 
into boiling water, incubated for 1 min, and then plunged into ice. After cooling, 
the tube was centrifuged at 4°C for 3 min at 14,000 X g. The supernatant was 
transferred to a fresh tube and frozen at -70°C. About 5 u,l of this supernatant 
was used for each 2D gel. 

2D polyacrytamide gels. 2D gels were made and run as described elsewhere 
(6-8). 

Image analysis of the gels. The Quest II software system was used for quan- 
titative image analysis (20, 22). Two techniques were used to collect quantitative 
data for analysis by Quest II software. First, before the advent of phosphorim- 
agers, gels were dried and fluorographed. Each gel was exposed to film for three 
different times (typically 1 day, 2 weeks, and 6 weeks) to increase the dynamic 
range of the data. The films were scanned along with calibration strips to relate 
film optical density to disintegrations per minute in the gels and analyzed by the 
software to obtain a linear relationship between disintegrations per minute in the 
spots and optical densities of the film images. The quantitative data are ex- 
pressed as parts per million of the total cellular protein. This value is calculated 
from the disintegrations per minute of the sample loaded onto the gel and by 
comparing the film density of each data spot with density of the film over the 
calibration strips of known radioactivity exposed to the same film. This yields the 
disintegrations per minute per millimeter for each spot on the gel and thence its 
parts- per-minute value. 

After the advent of phosphorimaging, gels bearing 35 S-labeled proteins were 
exposed to phosphorimager screens and scanned by a Fuji phosphorimagcr, 
typically for two exposures per gel. Calibration strips of known radioactivity were 
exposed simultaneously. Scan data from the phosphorimager was assimilated by 
Quest II software, and quantitative data were recorded for the spots on the gels. 
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Measurements of protein turnover. Cells in exponential phase were pulse- 
labeled with | 35 S|methionine, excess cold Met and Cys were added, and samples 
of equal volume were taken from the culture at intervals up to 90 min (in one 
experiment) or up to 160 min (in a second experiment). Incorporation of 35 S into 
protein was essentially 100% by the first sample (10 min). Extracts were made, 
and equal fractions of the samples were loaded on 2D gels (i.e., the different 
samples had different amounts of protein but equal amounts of 35 S). Spots were 
quantitated with a phosphorimaging and Quest software. 

The software was queried for spots whose radioactivity decreased through the 
time course. The algorithm examined all data points for all spots, drew a best-fit 
line through the data points, and looked for spots where this line had a statis- 
tically significant negative slope. In one of the experiments, there was one such 
spot. To the eye, this was a minor, unidentified spot seen only in the first two 
samples (10 and 20 min). In the other experiment, the Quest software found no 
spots meeting the criteria. Therefore, we concluded that none of the identified 
spots (and all but one of the visible spots) represented proteins with long 
half-lives. 

Centrifugal fractionation. Cells were labeled, harvested, and broken with glass 
beads by the standard method described above except that no detergent (i.e., no 
deoxycholate) was present in the lysis buffer. The crude iysate was cleared of 
unbroken cells and large debris by centrifugation at 300 X g for 30 s. The 
supernatant of this centrifugation was then spun at 16,000 X g for 10 min to give 
the pellet used for Fig. 6B. The supernatant of the 16,000 X g, 10-min spin was 
then spun at 100,000 X g for 30 min to give the supernatant used for Fig. 6A. 

Protein abundance calculations. A haploid yeast cell contains about 4 X 10" 12 
g of protein (1, 15). Assuming a mean protein mass of 50 kDa, there are about 
50 X 10 6 molecules of protein per cell. There are about 1.8 methionines per 10 
kDa of protein mass, which implies 4.5 X 10 8 molecules of methionine per cell 
(neglecting the small pool of free Met). We measured (i) the counts per minute 
in each spot on the 2D gels, (ii) the total number of counts on each gel (by 
integrating counts over the entire gel), and (iii) the total number of counts 
loaded on the gel (by scintillation counting of the original sample). Thus, we 
know what fraction of the total incorporated radioactivity is present in each spot. 
After correcting for the methionine (and cysteine [see below)) content of each 
protein, we calculated an absolute number of protein molecules based on the 
fraction of radioactivity in each spot and on 50 X 10 6 total molecules per cell. 

The labeling mixture used contained about one-fifth as much radioactive 
cysteine as radioactive methionine. Therefore, the number of cysteine molecules 
per protein was also taken into account in calculating the number of molecules 
of protein, but Cys molecules were weighted one-fifth as heavily as Met mole- 
cules. 

mRNA abundance calculations. For estimation of mRNA abundance, we used 
SAGE (serial analysis of gene expression) data (27) and Affymetrix chip hybrid- 
ization data (29a, 30). The mRNA column in Table 1 shows mRNA abundance 
calculated from SAGE data alone. However, the SAGE data came from cells 
growing in YEPD medium, whereas our protein measurements were from cells 
growing in YNB medium. In addition, SAGE data for low-abundance mRNAs 
suffers from statistical variation. Therefore, we also used chip hybridization data 
(29a, 30) for mRNA from cells grown in YNB. These hybridization data also had 
disadvantages. First, the amounts of high-abundance mRNAs were systemati- 
cally underestimated, probably because of saturation in the hybridizations, which 
used 10 jig of cRNA. For example, the abundance of ADH1 mRNA was 197 
copies per cell by SAGE but only 32 copies per cell by hybridization, and the 
abundance of EN02 mRNA was 248 copies per cell by SAGE but only 41 by 
hybridization. When the amount of cRNA used in the hybridization was reduced 
to 1 jig, the apparent amounts of mRNA were similar to the amounts determined 
by SAGE (29a, 29b). However, experiments using 1 u.g of cRNA have been done 
for only some genes (29a). Because amounts of mRNA were normalized to 
15,000 per cell, and because the amounts of abundant mRNAs were underesti- 
mated, there is a 2.2-fold overestimate of the abundance of nonabundant 
mRNAs. We calculated this factor of 2.2 by adding together the number of 
mRNA molecules from a large number of genes expressed at a low level for both 
SAGE data and hybridization data. The sum for the same genes from hybrid- 
ization data is 2.2-fold greater than that from SAGE data. 

To take into account these difficulties, we compiled a list of "adjusted'* mRNA 
abundance as follows. For all high-abundance mRNAs of our identified proteins, 
we used SAGE data. For all of these particular mRNAs, chip hybridization 
suggested that mRNA abundance was the same in YEPD and YNB media. For 
medium-abundance mRNAs, SAGE data were used, but when hybridization 
data showed a significant difference between YEPD and YNB, then the SAGE 
data were adjusted by the appropriate factor. Finally, for low-abundance 
mRNAs, we used data from chip hybridizations from YNB medium but divided 
by 2.2 to normalize to the SAGE results. These calculations were completed 
without reference to protein abundance. 

CAL The codon adaptation index (CAI) was taken from the yeast proteome 
database (YPD) (13), for which calculations were made according to Sharp and 
Li (24). Briefly, the index uses a reference set of highly expressed genes to assign 
a value to each codon, and then a score for a gene is calculated from the 
frequency of use of the various codons in that gene (24). 

Statistical analysis. The JMP program was used with the aid of T. Tully. The 
JMP program showed that neither mRNA nor protein abundances were nor- 
mally distributed; therefore. Spearman rank correlation coefficients (r,) were 



calculated. The mRNA (adjusted and unadjusted) and protein data were also 
transformed so that Pearson product-moment correlation coefficients (r p ) could 
be calculated. First, this was done by a Box-Cox transformation of log-trans- 
formed data. This transformation produced normal distributions, and an r p of 
0.76 was achieved. However, because the Box-Cox transformation is complex, we 
also did a simpler logarithmic transformation. This produced a normal distribu- 
tion for the protein data. However, the distribution for the mRNA and adjusted 
mRNA data was close to, but not quite, normal. Nevertheless, we calculated the 
r p and found that it was 0.76, identical to the coefficient from the Box-Cox 
transformed data. We therefore believe that this correlation coefficient is not 
misleading, despite the fact that the log(mRNA) distribution is not quite normal. 



RESULTS 

Visualization of 1,400 spots on three gel systems. Yeast 
proteins have isoelectric points ranging from 3.1 to 12.8, and 
masses ranging from less than 10 kDa to 470 kDa. It is difficult 
to examine all proteins on a single kind of gel, because a gel 
with the needed range in pi and mass would give poor resolu- 
tion of the thousands of spots in the central region of the gel. 
Therefore, we have used three gel systems: (i) pH "4 to 8" with 
10% polyacrylamide; (ii) pH "3 to 10" with 10% polyacryl- 
amide; and (iii) nonequilibrium with 15% polyacrylamide (7, 
8). Each gel system allows good resolution of a subset of yeast 
proteins. 

Figure 1 shows a pH 4-8, 10% polyacrylamide gel. The pH 
at the basic end of the isoelectric focusing gel cannot be main- 
tained throughout focusing, and so the proteins resolved on 
such gels have isoelectric points between pH 4 and pH 6.7. For 
these pH 4-8 gels, we see 600 to 900 spots on the best gels after 
multiple exposures. 

The pH 3-10 gels (not shown) extend the pi range somewhat 
beyond pH 7.5, allowing detection of several hundred addi- 
tional spots. Finally, we use nonequilibrium gels with 15% 
acrylamide in the second dimension. These allow visualization 
of about 100 very basic proteins and about 170 small proteins 
(less than 20 kDa). In total, using all three gel systems, about 
1,400 spots can be seen. These represent about 1,200 different 
proteins, which is about one-quarter to one-third of the pro- 
teins expressed under these conditions (27, 30). Here, we focus 
on the proteins seen on the pH 4-8 gels. 

Although nearly all expressed proteins are present on these 
gels, the number seen is limited by a problem we call coverage. 
Since there are thousands of proteins on each gel, many pro- 
teins comigrate or nearly comigrate. When two proteins are 
resolved, but are close together, and one protein spot is much 
more intense than the other, a problem arises in visualizing the 
weaker spot: at long exposures when the weak signal is strong 
enough for detection, the signal from the strong spot spreads 
and covers the signal from the weaker spot. Thus, weak spots 
can be seen only when they are well separated from strong 
spots. 

For a given gel, the number of detectable spots initially rises 
with exposure time. However, beyond an optimal exposure, the 
number of distinguishable spots begins to decrease, because 
signals from strong spots cover signals from nearby weak spots. 
At long exposures, the whole auto radiogram turns black. Thus, 
there is an optimum exposure yielding the maximum number 
of spots, and at this exposure the weakest spots are not seen. 

Largely because of the problem of coverage, the proteins 
seen are strongly biased toward abundant proteins. All identi- 
fied proteins have a CAI of 0.18 or more, and we have iden- 
tified no transcription factors or protein kinases, which are 
nonabundant proteins. Thus, this technology is useful for ex- 
amining protein synthesis, amino acid metabolism, and glyco- 
lysis but not for examining transcription, DNA replication, or 
the cell cycle. 
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FIG. 1. 2D gels. The horizontal axis is the isoelectric focusing dimension, which stretches from pH 6.7 (left) to pH 4.3 (right). The vertical axis is the polyacryiamide 
gel dimension, which stretches from about 15 kDa (bottom) to at least 130 kDa (top). For panel A, extract was made from cells in log phase in glucose; for panel B, 
cells were grown in cthanol. The spots labeled 1 through 6 are unidentified proteins highly induced in ethanol. 
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Spot identification. The identification of various spots has 
been described elsewhere (7, 8). At present, 169 different spots 
representing 148 proteins have been identified. Many of these 
spots have been independently identified (2, 10, 23, 25). The 
main methods used in spot identification have been analysis of 
amino acid composition, gene overexpression, peptide se- 
quencing, and mass spectrometry. 

Pulse-chase experiments and protein turnover. Pulse-chase 
experiments were done to measure protein half-lives (Materi- 
als and Methods). Cells were labeled with [ 35 S]methionine for 
10 min, and then an excess of unlabeled methionine was added. 
Samples were taken at 0, 10, 20, 30, 60, and 90 min after the 
beginning of the chase. Equal amounts of 35 S were loaded from 
each sample; 2D gels were run, and spots were quantitated. 
Surprisingly, almost every spot was nearly constant in amount 
of radioactivity over the entire time course (not shown). A few 
spots shifted from one position to another because of post- 
translational modifications (e.g., phosphorylation of RpaO and 
Efbl). Thus, the proteins being visualized are all or nearly all 
very stable proteins, with half-lives of more than 90 min. Gygi 
et al. (10) have come to a similar conclusion by using the N-end 
rule to predict protein half-lives. This result does not imply 
that all yeast proteins are stable. The proteins being visualized 
are abundant proteins; this is partly because they are stable 
proteins. 

Protein quantitation. Because all of the proteins seen had 
effectively the same half-life, the abundance of each protein 
was directly proportional to the amount of radioactivity incor- 
porated during labeling. Thus, after taking into account the 
total number of protein molecules per cell, the average content 
of methionine and cysteine, and the methionine and cysteine 
content of each identified protein, we could calculate the abun- 
dance of each identified protein (Tables 1 and 2; Materials and 
Methods). About 1,000 unidentified proteins were also quan- 
tified, assuming an average content of Met and Cys. 

Many proteins give multiple spots (7, 8). The contribution 
from each spot was summed to give the total protein amount. 
However, many proteins probably have minor spots that we are 
not aware of, causing the amount of protein to be underesti- 
mated. 

When the proteins on a pH 4-8 gel were ordered by abun- 
dance, the most abundant protein had 8,904 ppm, the 10th 
most abundant had 2,842 ppm, the 100th most abundant had 
314 ppm, the 500th most abundant had 57 ppm, and the 
1,000th most abundant (visualized at greater than optimum 
exposure) had 23 ppm. Thus, there is more than a 300-fold 
range in abundance among the visualized proteins. The most 
abundant 10 proteins account for about 25% of the total pro- 
tein on the pH 4-8 gel, the most abundant 60 proteins account 
for 50%, and the most abundant 500 proteins account for 80%. 
Since it seems likely that the pH 4-8 gels give a representative 
sampling of all proteins, we estimate that half of the total 
cellular protein is accounted for by fewer than 100 different 
gene products, principally glycolytic enzymes and proteins in- 
volved in protein synthesis. 

Correlation of protein abundance with mRNA abundance. 
Estimates of mRNA abundance for each gene have been made 
by SAGE (27) and by hybridization of cRNA to oligonucleo- 
tide arrays (30). These two methods give broadly similar re- 
sults, yet each method has strengths and weaknesses (Materials 
and Methods). Table 1 lists the number of molecules of mRNA 
per cell for each gene studied. One measurement (mRNA) 
uses data from SAGE analysis alone (27); a second incorpo- 
rates data from both SAGE and hybridization (30) (adjusted 
mRNA) (Table 1; Materials and Methods). We correlated 
protein abundance with mRNA abundance (Fig. 2). For ad- 



justed mRNA versus protein, the Spearman rank correlation 
coefficient, r s , was 0.74 (P < 0.0001), and the Pearson corre- 
lation coefficient, r pi on log transformed data (Materials and 
Methods) was 0.76 (P < 0.00001). We obtained similar corre- 
lations for mRNA versus protein and also for other data trans- 
formations (Materials and Methods). Thus, several statistical 
methods show a strong and significant correlation between 
mRNA abundance and protein abundance. Of course, the cor- 
relation is far from perfect; for mRNAs of a given abundance, 
there is at least a 10-fold range of protein abundance (Fig. 2). 
Some of this scatter is probably due to posttranscriptional 
regulation, and some is due to errors in the mRNA or protein 
data. For example, the protein YeO runs poorly on our gels, 
giving multiple smeared spots. Its abundance has probably 
been underestimated, partly explaining the low protein/mRNA 
ratio of YeO. It is the most extreme outlier in Fig. 2. 

These data on mRNA (27, 30) and protein abundance (Ta- 
ble 1) suggest that for each mRNA molecule, there are on 
average 4,000 molecules of the cognate protein. For instance, 
for Actl (actin) there are about 54 molecules of mRNA per 
cell and about 205,000 molecules of protein. Assuming an 
mRNA half-life of 30 min (12) and a cell doubling time of 120 
min, this suggests that an individual molecule of mRNA might 
be translated roughly 1,000 times. These calculations are lim- 
ited to mRNAs for abundant proteins, which are likely to be 
the mRNAs that are translated best. 

A full complement of cell protein is synthesized in about 120 
min under these conditions. Thus, 4,000 molecules of protein 
per molecule of mRNA implies that translation initiates on an 
mRNA about once every 2 s. This is a remarkably high rate; it 
implies that if an average mRNA bears 10 ribosomes engaged 
in translation, then each ribosome completes translation in 
20 s; if an average protein has 450 residues; this in turn implies 
translation of over 20 amino acids per s, a rate considerably 
higher than estimated for mammalians (3 to 8 amino acids per 
s) (18). These estimates depend on the amount of mRNA per 
cell (11, 27). 

The large number of protein molecules that can be made 
from a single mRNA raises the issue of how abundance is 
controlled for less abundant proteins. Many nonabundant pro- 
teins may be unstable, and this would reduce the protein/ 
mRNA ratio. In addition, many nonabundant proteins may be 
translated at suboptimal rates. We have found that mRNAs for 
nonabundant proteins usually have suboptimal contexts for 
translational initiation. For example, there are over 600 yeast 
genes which probably have short open reading frames in the 
mRNA upstream of the main open reading frame (17a). These 
may be devices for reducing the amount of protein made from 
a molecule of mRNA. 

Correlation of codon bias with protein abundance. The 
mRNAs for highly expressed proteins preferentially use some 
codons rather than others specifying the same amino acid (14). 
This preference is called codon bias. The codons preferred are 
those for which the tRNAs are present in the greatest amounts. 
Use of these codons may make translation faster or more 
efficient and may decrease misincorporation. These effects are 
most important for the cell for abundant proteins, and so 
codon bias is most extreme for abundant proteins. The effect 
can be dramatic — highly biased mRNAs may use only 25 of the 
61 codons. 

We asked whether the correlation of codon bias with abun- 
dance continues for medium-abundance proteins. There are 
various mathematical expressions quantifying codon bias; here, 
we have used the CAI (24) (Materials and Methods) because 
it gives a result between 0 and 1. The r s for CAI versus protein 
abundance is 0.80 (P < 0.0001), similar to the mRNA-protein 



TABLE 1. Quantitative data" 



Function 



Name 


CAI 


mRNA 


Adjusted mRNA 


Protein (Glu) (10 3 ) 


Protein (Eth) (10 3 ) 


E/G ratio 


Adhl 


0.810 


197 


197 




1 ,230 




972 


0.79 


Adh2 


0.504 


0 




0 


963 


>20 


Cit2 


0.185 


1 * 


2.8 


23 


288 


12 


Enol 


0.870 


No Wfo 




410 


974 


2.4 


Eno2 


0.892 


248 


248 


650 


215 


0.33 


Fbal 


0.868 


179 


179 


640 


608 


0.95 


Hxkl,2 


0.500 


13 


10.5 


62 


46 




Icll 


0 251 


0 




0 


671 


>20 


Pdbl 


0.342 


5 


5 ' 


41 


33 




Pdcl 


0.903 


226 


226 


280 


205 


0.73 


Pfkl 


0.465 


5 


5 


75 


. 53 


0.71 


Pgil 


0.681 


14 


14 


160 


120 


0.75 


Pycl 


0.260 


1 


0.7 


37 


34 




Tall 


0.579 


5 


5 


110 


35 




Tdh2 


0.904 


63 


63 


430 


876 


NR 


Tdh3 


0.924 


460 


460 


1,670 


1,927 


NR 


Toil 

r 


0.817 


No Ma 




No Met 


No Met 




Efbl 


0.762 


33 


16.5 


358 


362 




Eftl,2 


0.801 


26 


26 


99 


54 


0.55 


Prtl 


, 0.303 


4 


0.7 


12 


6 




RpaO 


0.793 


246 


246 


277 


100 


0.36 


Tifl,2 


0.752 


29 


29 


233 


106 


0.46 


Yef3 


0.777 


36 


36 


14 


ND 




Hsc82 


0.581 


2 


2.9 


112 


75 


0.67 


Hsd60 


0.381 


9 


2.3 


35 


82 


2 3 


Hsp82 


o!517 


2 


1.3 


52 


135 


2.6 


Hspl04 


0.304 


7 


1 


70 


161 


2.3 


Kar2 


0.439 


5 


10.1 


43 


102 


2.4 


Ssal 


0.709 


2 


4.3 


303 


421 


1.4 


Ssa2 


0.802 


10 


5 


213 


324 


1.5 


Ssbl,2 


0.850 


50 


50 


270 


85 




Sscl 


0.521 


2 


2.6 


68 


80 


1.2 


Ssel 


0.521 


8 


8 


96 


48 




Stil 


0.247 


1 


1.1 


25 


44 


1-7 


Adel 


0.229 


4 


4 


14 


27 




Ade3 


0.276 


2 


1.7 


12 


9 




Ade5,7 


0.257 


2 


1.4 


14 


4 




Ar&4 


0 229 


1 


8 1 


41 


41 




Gdhl 


0^585 


10 


27* 


148 


55 




Glnl 


0.524 


11 


11 


77 


104 


1.3 


His4 


0.267 


3 


3 


15 


23 


1.5 


Ilv5 


0.801 


6 


6 


152 


109 


0.7 


Lys9 


0.332 


4 


4 


32 


17 


0.52 


Met6 


0.657 


NoNfo 


22 


190 


80 


0.42 


Pro2 


0.248 


3 


3 


30 


12 




Serl 


0.258 


2 


1.2 


15 


8 




Trp5 


0.319 


5 


5 


28 


12 




Actl 


0.710 


54 


54 


205 


164 


0.78 


Adkl 


0.531 


NoAtfa 




47 


43 




Ald6 


0.520 


3 


3 


181 


159 




Atp2 


0.424 


1 


4.1 


76 


109 


1.4 


Bmhl 


0.322 


46 


46 


191 


137 


0.72 


Bmh2 


0.384 


1 


1.4 


134 


147 




Cdc48 


0.306 


2 


2.4 


32 


26 




Cdc60 


0.299 


2 


0.86 


6 


2 




Erg20 


0.373 


5 


5 


92 


39 




Gppl 


0.603 


16 


5 


234 


158 




Gspl 


0.621 


3 


3 


115 


39 


0.34 


] P l\ 


0.620 


4 


4 


254 


147 


038 


Lcbl 


0.173 


0.3 


0.8 


19 


40 




Moll 


0.423 


0 


0.45 


20 


16 




raoi 


n AQQ 

U.4oo 


3 


3 


41 


19 


0.47 


Psal 


0.600 


15 


15 


148 


56 




Rnr4 


0.497 


6 


6 


44 


37 




Saml 


0.494 


5 


5 


59 


21 




Sam2 


0.497 


3 


15 


63 


20 




Sodl 


0.376 


36 


36 


631 


618 




Ubal 


0.212 


2 


2 


14 


20 




YKL056 


0.731 


62 


62 


253 


112 


0.44 


YLR109 


0.549 


21 


21 


930 






YMR116 


0.777 


41 


41 


184 


40 


0.20 



Carbohydrate metabolism 



Protein synthesis 



Heat shock 



Amino acid synthesis 



Miscellaneous 



" CAI, a measure of codon bias, is taken from the YPD. mRNA, number of mRNA molecules per cell from SAGE data (27); adjusted mRNA, number of mRNA 
molecules per cell based on both SAGE and chip hybridization (30) (see Materials and Methods); Protein <Glu), number of molecules of protein per cell in 
YNB-glucose; Protein (Eth), number of molecules of protein per cell in YNB-ethanol; E/G ratio, ratio of protein abundance in ethanol to glucose. The E/G ratio is 
not given if it was close to 1 or if it was not repeatable (NR) in multiple gels. Some gene products (e.g., Tifl and Tif2 (Tifl,2J) were difficult to distinguish on either 
a protein or an mRNA basis; these are pooled. No Nla, there was no suitable Nlalll site in the 3' region of the gene, and so there are no SAGE mRNA data; No Met, 
the mature gene product contains no methionines, and so there are no reliable protein data. 
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TABLE 2. Functions of proteins listed in Table 1 



Name" 



YPD title lines 6 



Adhl 

Adh2 

Cit2 

Enol 

Eno2 

Fbal 

Hxkl 

Hxk2 

Icll 

Pdbl 

Pdcl 

Pflcl 

Pgil 

Pycl 

Tall 

Tdh2 

Tdh3 

Tpil 

Efbl 
Eftl 
Eft2 
Prtl 

RpaO (RPPO) 

Tifl 

Tif2 

Yef3 

Hsc82 
Hsp60 
Hsp82 
Hspl04 

Kar2 

Ssal 
Ssa2 
Ssbl 
Ssb2 
Sscl 

Ssel 
Stil 

Adel 
Ade3 
Ade5,7 



Gdhl 
Glnl 
His4 

Ilv5 

Lys9 
Met6 

Pro2 
Serl 
Trp5 

Actl 

Adkl 

AJd6 

Atp2 

Bmhl 

Bmh2 

Cdc48 

Cdc60 

Erg20 

Gppl (Rhr2) 
Gspl 

Moll (Thi4) 
Pabl 

Psal 

Rnr4 

Saml 

Sam2 

Sodl 

Ubal 



Alcohol dehydrogenase 1; cytoplasmic isozyme reducing acetaldehyde to ethanol, regenerating NAD* 
Alcohol dehydrogenase II; oxidizes ethanol to acetaldehyde, glucose repressed 

Citrate synthase, peroxisomal (nonmitochondrial); converts acetyl-CoA and oxaloacetate to citrate plus CoA 
Enolase 1 f2-phosphoglycerate dehydratase}; converts 2-phospho-D-glycerate to phosphoenolpyruvate in glycolysis 
Enolase 2 (2-phosphoglycerate dehydratase); converts 2-phospho-D-glycerate to phosphoenolpyruvate in glycolysis 
Fructose bispnosphate aldolase II; sixth step in glycolysis 

Hexokinase I; converts hexoses to hexose phosphates in glycolysis; repressed by glucose 

Hexokinase II; converts hexoses to hexose phosphates in glycolysis and plays a regulatory role in glucose repression 
Isocitrate lyase, peroxisomal; carries out part of the gryoxylate cycle; required for gluconeogenesis 
Pyruvate dehydrogenase complex, El beta subunit 
Pyruvate decarboxylase isozyme 1 

Phosphofructokinase alpha subunit, part of a complex with Pfk2p which carries out a key regulatory step in glycolysis 
Glucose-6-phosphate isomerase, converts gIucose-6-phosphate to fructose-6-phosphate 
Pyruvate carboxylase 1; converts pyruvate to oxaloacetate for gluconeogenesis 
Transaldolase; component of nonoxidative part of pentose phosphate pathway 

Glyceraldehyde-3-phosphate dehydrogenase 2; converts D-glyceraldehyde 3-phosphate to 1,3-dephosphoglycerate 
Glyceraldehyde-3 : phosphate dehydrogenase 3; converts D-glyceraldehyde 3-phosphate to 1,3-dephosphoglycerate 
Tnosephosphate isomerase; interconverts glyceraldehyde-3 -phosphate and dihydroxyacetone phosphate 

Translation elongation factor EF-10; GDP/GTP exchange factor for Teflp/Tef2p 

Translation elongation factor EF-2; contains diphthamide which is not essential for activity; identical to Eft2p 
Translation elongation factor EF-2; contains diphthamide which is not essential for activity; identical to Eftlp 
Translation initiation factor eIF3 beta subunit (p90); has an RNA recognition domain 
Acidic ribosomal protein AO 

Translation initiation factor 4A feIF4A) of the DEAD box family 
Translation initiation factor 4A j(cIF4A) of the DEAD box family 
Translation elongation factor EF-3A; member of ATP-binding cassette superfamily 

Chaperonin homologous to E. coli HtpG and mammalian HSP90 

Mitochondrial chaperonin that cooperates with HsplOp; homolog of E. coli GroEL 

Heat-inducible chaperonin homologous to E. coli HtpG and mammalian HSP90 

Heat shock protein required for induced thermotolerance and for resolubilizing aggregates of denatured proteins; important for fpsi~l- 
to-[PSI + ] prion conversion ir J 

Heat shock protein of the endoplasmic reticulum lumen required for protein translocation across the endoplasmic reticulum membrane 
and for nuclear fusion; member of the HSP70 family 

Cytoplasmic chaperone; heat shock protein of the HSP70 family 

Cytoplasmic chaperone; member of the HSP70 family 

Heat shock protein of HSP70 family involved in the translational apparatus 

Heat shock protein of HSP70 family, cytoplasmic 

Mitochondrial protein that acts as an import motor with Tim44p and plays a chaperonin role in receiving and folding of protein chains 

during import; heat shock protein of HSP70 family " 
Heat shock protein of the HSP70 family; multicopy suppressor of mutants with hyperactivated Ras/cyclic AMP pathway 
Stress- induced protein required for optimal growth at nigh and low temperature; has tetratricopeptide repeats 

Phosphoribosylamidoimidazole-succinocarboxamide synthase: catalyzes the seventh step in de novo purine biosynthesis pathway 
C, tetrahydrofolate synthase (trifunctional enzyme}, cytoplasmic 

Phosphoribosylamine-glycine ligase plus phosphoribosylformylglycinamidine cyclo-ligase; bifunctional protein 
Argininosuccinate lyase; catalyzes the final step in arginine biosynthesis 

Glutamate dehydrogenase (NADP*); combines ammonia and a-ketoglutarate to form glutamate 
Glutamine synthetase; combines ammonia to glutamate in ATP-driven reaction 

Phosphpribosyl-AMP cyclohydrolase/phosphoribosyl-ATP pyrophosphohydrolase/histidinol dehydrogenase; 2nd, 3rd, and 10th steps of 
his biosynthesis pathway r 

Ketol-acid reductoisomerase (acetohydroxy, acid reductoisomerase) (alpha-keto-8-hydroxylacyI) red ucto isomerase); second step in Val 
and Ilv biosynthesis pathway 

Saccharopine dehydrogenase (NADP+, L-glutamate forming! (saccharopine reductase), seventh step in lysine biosynthesis pathway 
Homocysteine methyltransferase; (5-methyltetrahydropteroyf triglutamate-homocysteine methyl transferase), methionine synthase, 

cobalamin independent 3 
7-Glutamyl phosphate reductase (phosphoglutamate dehydrogenase), proline biosynthetic enzyme 
Phosphoserine transaminase; involved in synthesis of serine from 3-phosphoglycerate 
Tryptophan synthase, last (5th) step in tryptophan biosynthesis pathway 

Actin; involved in cell polarization, endocytosis, and other cytoskeleta) functions 
Adenylate kinase (GTF:AMP phosphotransferase), cytoplasmic 
Cytosolic acetaldehyde dehydrogenase 

Beta subunit of Fl-ATP synthase; 3 copies are found in each Fl oligomer 
Homolog of mammalian 14-3-3 protein; has strong similarity to Bmh2p 
Homolog of mammalian 14-3-3 protein; has strong similarity to Bmhlp 

Protein of the AAA family of ATPases; required for cell division and homotypic membrane fusion 
Leucyl-tRNA synthetase, cytoplasmic 

Farnesyl pyrophosphate synthetase; may be rate-limiting step in sterol biosynthesis pathway 
DL-Glycerol phosphate phosphatase 

Ran, a GTP-binding protein of the Ras superfamily involved in trafficking through nuclear pores 
Inorganic pyrophosphatase, cytoplasmic 

Component of serine C-palmitoyltransferase; first step in biosynthesis of long-chain base component of sphingolipids 
Thiamine-repressed protein essential for growth in the absence of thiamine 

Poly(A)-binding protein of cytoplasm and nucleus; part of the 3'-end RNA-processing complex (cleavage factor I); has 4 RNA 

recognition domains 
Mannose-1 -phosphate guanyl transferase; GDP-mannose pyrophosphorylase 
Ribonucleotide reductase small subunit 
5-Adenosylmethionine synthetase 1 
5-AdenosyImethionine synthetase 2 
COpper-zmc superoxide dismutase 
Ubiquitin-activating (El) enzyme 



YKL056 Resembles translationally controlled tumor protein of animal cells and higher plants 

YLR109 (Ahpn Alkyl hydroperoxide reductase 

YMR116 (Ascl) Abundant protein with effects on translational efficiency and cell size, has two WD (WD-40) repeats 

* Accepted name from the Saccharomyces genome database and YPD. Names in parentheses represent recent changes. 
6 Courtesy of Proteome, Inc., reprinted with permission. 
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FIG. 2. Correlation of protein abundance with adjusted mRNA abundance. 
The number of molecules peT cell of each protein is plotted against the number 
of molecules per cell of the cognate mRNA, with an r p of 0.76. Note the 
logarithmic axes. Data for mRNA were taken from references 27 and 30 and 
combined as described in Materials and Methods. 



correlation, confirming a strong correlation between CAI and 
protein abundance (Fig. 3). The relationship between CAI and 
protein abundance is log linear from about 1,000,000 to about 
10,000 molecules per cell. We have no data for rarer proteins. 

It is not clear whether CAI reflects maximum or average 
levels of protein expression. The proteins used for the CAI- 
protein correlation included some proteins which were not 
expressed at maximum levels under the condition of the ex- 
periment (Hsc82, Hspl04, Ssal, Adel, Arg4, His4, and others). 
When these proteins were removed from consideration and 
the correlation between CAI and the remaining (presumably 
constitutive) proteins was recalculated, the r s was essentially 
unchanged (not shown). 

The equation describing the graph in Fig. 3 is log (protein 
molecules/cell) - (2.3 X CAI) + 3.7. Thus, under certain 
conditions (a CAI of 0.3 or greater; a constitutively expressed 
gene), a very rough estimate of protein abundance can be 
made by raising 10 to the power of [(2.3 X CAI) + 3.7]. 

The distribution of CAI over the genome (Fig. 4) consists of 
a lower, bell-shaped distribution, possibly indicating a region 
where there is no selection for codon bias, and an upper, flat 
distribution, starting at a CAI of about 0.3, possibly indicating 
a region where there is selection for codon bias. Almost all of 
the proteins whose abundance we have measured are in the 
upper, flat portion of the distribution. In the lower, bell-shaped 
region, we do not know whether there is a correlation between 
CAI and protein abundance. 

Changes in protein abundance in glucose and ethanol. A 
comparison of cells grown in glucose (Fig. 1A) with cells grown 
in ethanol (Fig. IB) is shown in Table 1. As is well known, 
some proteins are induced tremendously during growth on 
ethanol. Two striking examples are the peroxisomal enzymes 
Icll (isocitrate lyase) and Cit2 (citrate synthase), which are 
induced in ethanol by more than 100- and 12-fold, respectively 
(Fig. 1; Table 1). These enzymes are key components of the 
glyoxylate shunt, which diverts some acetyl coenzyme A 
(acetyl -Co A) from the tricarboxylic acid cycle to gluconeogen- 
esis. 5. cerevisiae requires large amounts of carbohydrate for its 
cell wall; in ethanol medium, this carbohydrate comes from 
gluconeogenesis, which depends on the glyoxylate shunt and 
on the glycolytic pathway running in reverse. The need for 



gluconeogenesis also explains why glycolytic enzymes are 
abundant even in ethanol medium. Thus, 2D gel analysis shows 
the prominence of the glycolytic and glyoxylate shunt enzymes 
in cells grown on ethanol, emphasizing that gluconeogenesis, 
presumably largely for production of the cell wall, is a major 
metabolic activity under these conditions. 

During gluconeogenesis, substrate-product relationships are 
reversed for the glycolytic enzymes. One might expect that not 
all glycolytic enzymes would be well adapted to the reverse 
reaction. Indeed, 2D gels show that in ethanol, Adh2 (alcohol 
dehydrogenase 2) is strongly induced (16), while its isozyme 
Adhl is not greatly affected. Adhl and Adh2 each interconvert 
acetaldehyde and ethanol. Adhl has a relatively high K m for 
ethanol (17 mM), while Adh2 has a lower K m (0.8 mM) (5). 
Thus, it is thought that Adhl is specialized for glycolysis (ac- 
etaldehyde to ethanol), while Adh2 is specialized for respira- 
tion (ethanol to acetaldehyde) (5, 29). Similarly, Enol (enolase 
1) is induced in ethanol, while its isozyme Eno2 (enolase 2) 
decreases in abundance (Table 1) (4, 19). Enol is inhibited by 
2-phosphoglycerate (the glycolytic substrate), while Eno2 is 
inhibited by phosphoenolpyruvate (the gluconeogenic sub- 
strate) (4). Perhaps Enol has a lower K m for phosphoenol- 
pyruvate than does Eno2, though to our knowledge this has not 
been tested. Thus, the 2D gels distinguish isozymes specialized 
for growth on glucose (Adhl and Eno2) from isozymes spe- 
cialized for ethanol (Adh2 and Enol). 

Many heat shock proteins (e.g., Hsp60, Hsp82, Hspl04, and 
Kar2) were about twofold more abundant in ethanol medium 
than in glucose medium. This is consistent with the increased 
heat resistance of cells grown in ethanol (3). 

Enzymes involved in protein synthesis (Eftl, RpaO, and Tifl) 
were about twice as abundant in glucose medium as in ethanol 
medium. This may reflect the higher growth rate of the cells in 
glucose. 

Phosphorylation of proteins. To examine protein phosphor- 
ylation, we labeled cells with 32 P and ran 2D gels to examine 
phosphoproteins. About 300 distinct spots, probably represent- 
ing 150 to 200 proteins, could be seen on pH 4-8 gels (Fig. 5B). 
We then aligned autoradiograms of three gels, each with a 
different kind of labeled protein ( 32 P only [Fig. 5B], 32 P plus 
35 S [Fig. 5 A], and 35 S only [not shown, but see Fig. 1 for 
example]). In this way, we made provisional identification of 
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FIG. 3. Correlation of protein abundance with CAI. The number of mole- 
cules per cell of each protein is plotted against the CAI for that protein. Note the 
logarithmic scale on the protein axis. Data for the CAI are from the YPD 
database (13). 



7364 FUTCHER ET AL. 



Mol. Cell. Biol. 



2000 - 



to 

C 

o 



CD 
-O 

E 



1000 



Jl 



.12 



.24 



.36 



.48 



.60 



.72 



.84 



Codon Adaptation Index 



FIG. 4. Distribution of CAI over the whole genome, shown in intervals of 0.030 (i.e., there are 150 genes with a CAI between 0.000 and 0.030, inclusive; 31 genes 
with a CAI between 0.031 and 0.060; 269 genes with a CAI between 0.061 and 0.090; 1,296 genes with a CAI between 0.091 and 0.120; etc). The distribution peaks 
with 2,028 genes with a CAI between 0.121 and 0.150. 



some of the 32 P-labeled spots as particular 35 S-Iabeled spots. 
All such identifications are somewhat uncertain, since precise 
alignments are difficult, and of course multiple spots may ex- 
actly comigrate. Nevertheless, we believe that most of the 
provisional identifications are probably correct. Among the 
major 32 P-labeled proteins are the hexokinases Hxkl and 
Hxk2, the acidic ribosome-associated protein RpaO, the trans- 
lation factors Yef3 and Efbl, and probably Hsp70 heat shock 
proteins of the Ssa and Ssb families. RpaO and Efbl are quan- 
titatively monophosphorylated. 

Many yeast proteins resolve into multiple spots on these 2D 
gels (7). Yef3 has five or more spots, at least four of which 
comigrate with 32 P. Tpil has a major spot showing no 32 P 
labeling and a minor, more acidic spot which overlaps with 
some 32 P label. Tifl has at least seven spots (7); two of these 
overlap with some 32 P label, but five do not (Fig. 5). Eftl has 
at least three spots (7), and none of these overlap with 32 P, 
although there are three nearby, unidentified 32 P-labeled spots 
(a, c, and d in Fig. 5). Spots that seem to be extra forms of 
Met6, Pdcl, Eno2, and Fbal can be seen in Fig. 6A, but there 
is little 32 P at these positions in Fig. 5. Thus, phosphorylation 
explains some but not all of the different protein isoforms seen. 

The cell cycle is regulated in part by phosphorylation. We 
compared 32 P-labeled proteins from cells synchronized in G, 
with a-factor, in cells synchronized in G, by depletion of G x 
cyclins, and in cells synchronized in M phase with nocodazole. 
Only very minor differences were seen, and these were difficult 
to reproduce. The cell cycle proteins regulated by phosphory- 
lation may not be abundant enough for this technique to be 
applied easily. 

Centrifugal fractionation. We fractionated 35 S-labeled ex- 
tracts by centrifugation (Materials and Methods). Figure 6A 
shows the proteins in the supernatant of a high-speed 
(100,000 X g, 30 min) centrifugation, while Fig. 6B shows the 
proteins in the pellet of a low-speed (16,000 X g, 10 min) 
centrifugation. Many proteins are tremendously enriched in 
one fraction or the other, while others are present in both. 



Most glycolytic enzymes (e.g., Tdh2, Tdh3, Eno2, Pdcl, Adhl, 
and Fbal) are enriched in the supernatant fraction. The only 
exception is Pfkl (not indicated), which is found in both pellet 
and supernatant fractions. Many proteins involved in protein 
synthesis (Eftl, Yef3, Prtl, Tifl, and RpaO) are in the pellet, 
possibly because of the association of ribosomes with the en- 
doplasmic reticulum. However, Efbl is in the supernatant, as is 
a substantial portion of the Eftl. Perhaps surprisingly, several 
mitochondrial proteins (Atp2 [not shown] and Ilv5) are largely 
in the supernatant. Perhaps glass bead breakage of cells re- 
leases mitochondrial proteins. The nuclear protein Gspl is in 
the pellet fraction. The enrichment produced by centrifugation 
makes it possible to see minor spots which are otherwise poorly 
resolved from surrounding proteins. Figure 6B shows that the 
previously identified Tifl spot is surrounded by as many as six 
other spots that cofractionate. We observed six identical or 
very similar additional spots when we overexpressed Tifl from 
a high-copy-number plasmid (not shown). Signal overlaps only 
one or two of these spots in 32 P-labeling experiments (Fig. 5), 
and so the different forms are not mainly due to different 
phosphorylation states. 

DISCUSSION 

Our experience with developing a 2D gel protein database 
for 5. cerevisiae is summarized here. With current technology, 
we can see the most abundant 1,200 proteins, which is about 
one-third to one-quarter of the proteins expressed. The re- 
maining proteins will be difficult to see and study with the 
methods that we have used, not because of a lack of sensitivity 
but because weak spots are covered by nearby strong spots. 

Of the 1,200 proteins seen, we have identified 148, with a 
bias toward the most abundant proteins. Steady application of 
the methods already used would allow identification of most of 
the remaining proteins. Gene overexpression will be particu- 
larly useful, since it is not affected by the lower abundance of 
the remaining visible proteins. 
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FIG. 5. Phosphoryiated proteins. (A) Mixture of 32 P-labeled proteins and 35 S-labcled proteins. Two separate labeling reactions were done, one with 32 P and one 
with 35 S, and extracts were mixed and run on a 2D gel. Spots marked with numbers rather than gene names represent spots noted on 35 S gels but unidentified. Spots 
labeling with 32 P were identified by (i) increased labeling compared to the 35 S-only gel (not shown); (ii) the characteristic fuzziness of a 32 P-labeled spot; and (iii) the 
decay of signal intensity seen on exposures made 4 weeks later (not shown). A minor form of Tpil and at least six minor forms of Tifl have been noted in overexpression 
experiments (see also Fig. 6B); positions of the minor forms are indicated by circles. (B) 32 P-only labeling. The major form of Tpil, which is not labeled with 32 P, is 
indicated by a large circle; positions of seven forms of Tifl are indicated by smaller circles. 
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FIG. 6. Fractionation by centrifugation. (A) Proteins in the supernatant of a 100,000 X g, 30-min spin; proteins in the pellet of a 16,000 Xg, 10-min spin. Supernatant 
fractions examined in multiple experiments done over a wide range of g forces looked similar to each other, as did the pellet fractions. 
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2D gels of the kind that we have used are not suitable for 
visualization of rare proteins. However it will be possible to 
study on a global basis metabolic processes involving relatively 
abundant proteins, such as protein synthesis, glycolysis, glu- 
coneogenesis, amino acid synthesis, cell wall synthesis, nucle- 
otide synthesis, lipid metabolism, and the heat shock response. 

Gygi et al. (10) have recently completed a study similar to 
ours. Despite generating broadly similar data, Gygi et al. 
reached markedly different conclusions. We believe that both 
mRNA abundance and codon bias are useful predictors of 
protein abundance. However, Gygi et al. feel that mRNA 
abundance is a poor predictor of protein abundance and that 
"codon bias is not a predictor of either protein or mRNA 
levels" (10). These different conclusions are partly a matter of 
viewpoint. Gygi et al. focus on the fact that the correlations of 
mRNA and codon bias with protein abundance are far from 
perfect, while we focus on the fact that, considering the wide 
range of mRNA and protein abundance and the undoubted 
presence of other mechanisms affecting protein abundance, 
the correlations are quite good. 

However, the different conclusions are also partly due to 
different methods of statistical analysis and to real differences 
in data. With respect to statistics, Gygi et al. used the Pearson 
product-moment correlation coefficient (r p ) to measure the 
covariance of mRNA and protein abundance. Depending on 
the subset of data included, their r p values ranged from 0.1 to 
0.94. Because of the low r values with some subsets of the 
data, Gygi et al. concluded* that the correlation of mRNA to 
protein was poor. However, the r p correlation is a parametric 
statistic and so requires variates following a bivariate normal 
distribution; that is, it would be valid only if both mRNA and 
protein abundances were normally distributed. In fact, both 
distributions are very far from normal (data not shown), and so 
a calculation of r p is inappropriate. There was no statistical 
backing for the assertion that codon bias fails to predict pro- 
tein abundance. 

We have taken two statistical approaches. First, we have 
used the Spearman rank correlation coefficient (r,). Since this 
statistic is nonparametric, there is no requirement for the data 
to be normally distributed. Using the r s , we find that mRNA 
abundance is well correlated with protein abundance (r, = 
0.74), and the CAI is also well correlated with protein abun- 
dance (r, = 0.80) (and also with mRNA abundance [data not 
shown]). For the data of Gygi et al. (10), we obtained similar 
results, though with their data the correlation is not as good; r s 
= 0.59 for the mRNA-to-protein correlation, and r s = 0.59 for 
the codon bias-to-protein correlation. 

In a second approach, we transformed the mRNA and pro- 
tein data to forms where they were normally distributed, to 
allow calculation of an r p (Materials and Methods). Two trans- 
formations, Box-Cox and logarithmic, were used; both gave 
good correlations with our data [e.g., r = 0.76 for log(adjusted 
RNA) to log(protein)]. We were not able to transform the data 
of Gygi et al. to a normal distribution. 

Finally, there are also some differences in data between the 
two studies. These may be partly due to the different measure- 
ment techniques used: Gygi et al. measured protein abundance 
by cutting spots out of gels and measuring the radioactivity in 
each spot by scintillation counting, whereas we used phospho- 
rimaging of intact gels coupled to image analysis. We com- 
pared our data to theirs for the proteins common between the 
studies (but excluding proteins whose mRNAs are known to 
differ between rich and minimal media, and excluding Tifl, 
which was anomalous in differing by 100-fold between the two 
data sets). The r s between the two protein data sets was 0.88 
(P < 0.0001). Although this is a strong correlation, the fact that 



it is less than 1.0 suggests that there may have been errors in 
measuring protein abundance in one or both studies. After 
normalizing the two data sets to assume the same amount of 
protein per cell, we found a systematic tendency for the protein 
abundance data of Gygi et al. to be slightly higher than ours for 
the highest-abundance proteins and also for the lowest-abun- 
dance proteins but slightly lower than ours for the middle- 
abundance proteins. These systematic differences suggest some 
systematic errors in protein measurement. Although we do not 
know what the errors are, we suggest the following as a rea- 
sonable speculation. For the highest-abundance proteins, we 
may have underestimated the amount of protein because of a 
slightly nonlinear response of the phosphorimager screens. For 
the lowest-abundance proteins, Gygi et al. may have overesti- 
mated the amount of protein because of difficulties in accu- 
rately cutting very small spots out of the gel and because of 
difficulties in background subtraction for these small, weak 
spots. The difference in the middle abundance proteins may be 
a consequence of normalization, given the two errors above. 

The low-abundance proteins in the data set of Gygi et al. 
have a poor correlation with mRNA abundance. We calculate 
that the r s is 0.74 for the top 54 proteins of Gygi et al. but only 
0.22 for the bottom 53 proteins, a statistically significant dif- 
ference. However, with our data set, the r s is 0.62 for the top 33 
proteins and 0.56 (not significantly different) for the bottom 33 
proteins (which are comparable in abundance to the bottom 
53 proteins of Gygi et al.). Thus, our data set maintains a good 
correlation between mRNA and protein abundance even at 
low protein abundance. This is consistent with our speculation 
that protein quantification by phosphorimaging and image 
analysis may be more accurate for small, weak spots than is 
cutting out spots followed by scintillation counting. Our rela- 
tively good correlations even for nonabundant proteins may 
also reflect the fact that we used both SAGE data and RNA 
hybridization data, which is most helpful for the least abundant 
mRNAs. In summary, we feel that the poor correlation of 
protein to mRNA for the nonabundant proteins of Gygi et al. 
may reflect difficulty in accurately measuring these nonabun- 
dant proteins and mRNAs, rather than indicating a truly poor 
correlation in vivo. It is not surprising that observed correla- 
tions would be poorer with less-abundant proteins and 
mRNAs, simply because the accuracy of measurement would 
be worse. 

How well can mRNA abundance predict protein abun- 
dance? With r p = 0.76 for logarithmically transformed mRNA 
and protein data, the coefficient of determination, (r p ) 2 , is 0.58. 
This means that more than half (in log space) of the variation 
in protein abundance is explained by variation in mRNA abun- 
dance. When converted back to arithmetic values, protein 
abundances vary over about 200-fold (Table 1), and (r p ) 2 = 
0.58 for the log data means that of this 200-fold variation, 
about 20-fold is explained by variation in the abundance of 
mRNA and about 10-fold is unexplained (but could be due 
partly to measurement errors). For proteins much less abun- 
dant than those considered here, we imagine the in vivo cor- 
relation between mRNA and protein abundance will be worse, 
and other regulatory mechanisms such as protein turnover will 
be more important. 

Some important conclusions can be drawn from this sam- 
pling of the proteome. First, there is an enormous range of 
protein abundance, from nearly 2,000,000 molecules per cell 
for some glycolytic enzymes to about 100 per cell for some cell 
cycle proteins (26a). Second, about half of all cellular protein 
is found in fewer than 100 different gene products, which are 
mostly involved in carbohydrate metabolism or protein synthe- 
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sis. Third, the correlation between protein abundance and CAI 
is log linear as far as we can see, which is from about 10,000 
protein molecules per cell to about 1,000,000. This is somewhat 
surprising, because it implies that selective forces for codon 
bias are significant even at moderate expression levels. It also 
means that codon bias is a useful predictor of protein abun- 
dance even for moderately low bias proteins. Fourth, there is a 
good correlation between protein abundance and mRNA 
abundance for the proteins that we have studied. This validates 
the use of mRNA abundance as a rough predictor of protein 
abundance, at least for relatively abundant proteins. Fifth, for 
these abundant proteins, there are about 4,000 molecules of 
protein for each molecule of mRNA. This last conclusion 
raises questions as to how the levels of nonabundant proteins 
are regulated and suggests that protein instability, regulated 
translation, suboptimal rates of translation, and other mecha- 
nisms in addition to transcriptional control may be very impor- 
tant for these proteins. 
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